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This  paper  presents  a  divide-and-conquer  ray-traced  volume  rendering  algorithm  and  a  parallel  image 
compositing  method,  along  with  their  implementation  and  performance  on  the  Connection  Machine 
CM-5,  and  networked  workstations.  This  algorithm  distributes  both  the  data  and  the  computations  to 
individual  processing  units  to  achieve  fast,  high-quality  rendering  of  high- resolution  data.  The  volume 
data,  once  distributed,  is  left  intact.  The  processing  nodes  perform  local  ray  tracing  of  their  subvolume 
concurrently.  No  communication  between  processing  units  is  needed  during  this  locally  ray-tracing 
process.  A  subimage  is  generated  by  each  processing  unit  and  the  final  image  is  obtained  by  compositing 
subimages  in  the  proper  order,  which  can  be  determined  a  priori.  Test  results  on  both  the  CM-5  and  a 
group  of  networked  workstations  demonstrate  the  practicality  of  our  rendering  algorithm  and  compositing 
method. 
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1  Introduction 


Existing  volume  rendering  methods,  though  capable  of  making  very  effective  visualizations,  are  very 
computationally  intensive  and  therefore  fail  to  achieve  interactive  rendering  rates  for  large  data  sets. 
Although  the  computing  technology  continues  to  advance,  the  increase  in  computer  processing  power 
has  never  seemed  to  catch  up  with  the  increase  in  data  size.  Our  work  was  motivated  by  the  following 
observations:  First,  volume  data  sets  can  be  quite  large,  often  too  large  for  a  single  processor  machine 
to  hold  in  memory  at  once.  Moreover,  high  quality  volume  renderings  normally  take  minutes  to  hours 
on  a  single  processor  machine  and  the  rendering  time  usually  grows  linearly  with  the  data  size.  To 
achieve  interactive  rendering  rates,  users  often  must  reduce  the  original  data,  which  produces  inferior 
visualization  results.  Second,  many  acceleration  techniques  and  data  exploration  techniques  for  volume 
rendering  trade  memory  for  time,  which  results  in  another  order  of  magnitude  increase  in  memory  use. 
Third,  motion  is  one  the  most  effective  visualization  techniques,  but  an  animation  sequence  of  volume 
visualization  normally  takes  hours  to  days  to  generate.  Finally,  we  notice  the  availability  of  massively 
parallel  computers  and  the  hundreds  of  high  performance  workstations  in  our  computing  environment. 
These  workstations  are  frequently  sitting  idle  for  many  hours  a  day.  All  the  above  lead  us  to  investigate 
ways  of  distributing  the  increasing  amount  of  data  as  well  as  the  time-consuming  rendering  process  to 
the  tremendous  distributed  computing  resources  available  to  us. 

In  this  paper,  we  describe  the  resulting  parallel  volume  rendering  algorithm,  which  consists  of  two 
parts:  parallel  ray-tracing  and  parallel  compositing.  In  our  current  implementation  on  the  CM-5  and 
networked  workstations,  the  parallel  volume  Tenderer  evenly  distributes  data  to  the  computing  resources 
available.  Without  the  need  to  communicate  with  other  processing  units,  each  subvolume  is  ray-traced 
locally  and  generates  a  partial  image.  The  parallel  compositing  process  then  merges  all  resulting  partial 
images  in  depth  order  to  achieve  the  complete  image.  The  compositing  process  is  particularly  effective  for 
massively  parallel  processing  as  it  always  makes  use  of  all  processing  units  by  continuously  subdividing 
the  partial  images  and  distributing  them  to  each  processing  unit.  Our  test  results  on  both  the  CM-5 
and  workstations  are  promising,  and  expose  different  performance  tuning  issues  for  each  platform. 

2  Previous  Work 

An  increasing  number  of  parallel  architectures  and  algorithms  for  volume  rendering  have  been  developed. 
The  major  algorithmic  strategy  for  parallelizing  volume  rendering  is  the  divide-and-conquer  paradigm. 
The  volume  rendering  problem  can  be  subdivided  either  in  data  space  or  in  image  space.  While  data- 
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space  subdivision  assigns  the  computation  associated  with  particular  subvolumes  to  processors,  image- 
space  subdivision  distributes  the  computation  associated  with  particular  portions  of  the  image  space. 
Data-space  subdivision  is  usually  applied  to  a  distributed-memory  parallel  computing  environment. 
On  the  other  hand,  image-space  subdivision  is  simple  and  efficient  for  shared-memory  multiprocessing. 
Hybrid  methods  are  also  feasible. 

Among  the  parallel  architectures  developed  which  are  capable  of  performing  interactive  volume  render¬ 
ing,  the  Pixel- Planes  5  system  [5]  is  a  heterogeneous  multiprocessor  graphics  system  using  both  MIMI) 
and  SIMD  parallelism.  The  hardware  consists  of  multiple  i860-based  Graphics  Processors,  multiple 
SIMD  pixel-processors  arrays  called  Renderers,  and  a  conventional  1 280 x  1024- pixel  frame  buffer,  inter¬ 
connected  by  a  five-gigabit  ring  network.  In  [21],  variations  of  paiodl  1  volume  rendering  implemented 
on  this  system  are  presented.  One  approach  similar  to  the  idea  we  proposed  earlier  in  [12]  and  now 
elaborate  in  this  paper,  distributes  data  as  well  as  ray  casting  among  separate  Graphics  Processors  and 
reconstructs  the  ray  segments  into  coherent  rays.  Incorporating  dynamic  load  balancing,  lookup  tables 
and  progressive  refinement,  this  approach  can  render  shaded  images  from  128x128x56  volume  data 
at  20  frames  per  second.  In  the  following  sections,  we  survey  most  recent  research  results  from  other 
algorithmic  approaches. 

2.1  Montani 

Montani  et  al.  [14]  propose  a  hybrid  ray-traced  method  for  running  on  distributed-memory  parallel 
systems  like  a  nCUBE,  in  which  processing  nodes  are  organized  into  a  set  of  clusters ,  each  of  them 
composed  of  the  same  number  of  nodes.  The  image  space  is  partitioned  and  a  subset  of  pixels  is  assigned 
to  each  cluster,  which  will  compute  pixel  values  independently.  Data  to  be  visualized  is  replicated  in 
each  cluster,  and  is  partitioned  among  the  local  memory  of  the  cluster’s  nodes.  A  static  load  balancing 
strategy  based  on  estimated  work  load  of  each  processor  is  used  to  improve  efficiency,  and  on  average 
a  twenty  percent  speedup  in  rendering  time  can  be  obtained.  In  addition,  a  mechanism  for  preventing 
deadlock  is  necessary  to  handle  the  dependency  between  processing  nodes  in  the  same  cluster.  The 
best  efficiency  reported  by  the  authors  while  using  a  single  cluster  of  128  nodes  is  0.74.  However,  when 
increasing  the  number  of  clusters,  the  efficiency  drops  significantly.  For  example,  using  16  clusters  with 
8  nodes  per  cluster,  the  efficiency  reported  is  only  0.31. 
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2.2  Nieh 


Nieh  •'nd  Levny  [15]  implement  ray-traced  volume  rendering  on  Stanford  DASH  Multiprocessors,  a 
scalable  shared-memory  MIMD  machine.  Their  method  employs  algorithmic  optimizations  such  as 
hierarchical  opacity  enumeration,  early  ray  termination,  and  adaptive  image  sampling  [9],  The  shared- 
memory  architecture  providing  a  single  address  space  allows  straightforward  implementations.  The 
parallel  algorithm  distributes  volume  data  in  an  interleaved  fashion  among  the  local  memories  to  avoid 
hot  spotting.  The  ray  tracing  computation  is  distributed  among  the  processors  by  partitioning  the 
image  plane  into  contiguous  blocks  and  each  processor  is  statically  assigned  an  image  block.  Each 
block  is  further  divided  into  square  ima^e  tiles  for  load  balancing  purposes.  When  a  processor  is  done 
computing  its  block,  instead  of  waiting,  it  steals  tiles  from  a  neighboring  processor’s  block  to  keep  itself 
busy.  Experiment  results  show  this  load  balancing  scheme  cuts  the  variation  of  execution  times  across 
the  48  processors  used  by  90%.  Currently,  each  processor  in  DASH  is  a  33  MHz  MIPS  R3000.  Using  all 
48  processors  available,  a  416x416-pixel  image  for  a  2563  data  set  can  be  generated  in  subseconds;  for 
nonadaptive  sampling,  the  speedup  over  uniprocessor  rendering  is  40. 

2.3  Schroder 

Schroder  and  Stoll  [19]  develop  a  data-parallel  ray-traced  volume  rendering  algorithm  that  exploits 
ray  parallelism.  They  describe  the  ray  tracing  steps  as  discrete  line  drawing.  This  algorithm  is  both 
more  memory  efficient  and  less  communications  bound  than  an  algorithm  introduced  earlier  by  the  first 
author  [18].  They  have  implemented  this  algorithm  on  both  the  Connection  Machine  CM-2  and  the 
Princeton  Engine,  which  consists  of  2048  16-bit  DSP  processors  arranged  in  a  ring.  To  allow  for  a  SIMD 
implementation,  rays  initially  enter  only  the  front-most  face  of  the  volume  and  proceed  in  lock  step. 
Consequently,  each  sample  has  the  same  local  coordinates  in  a  voxel.  When  rays  exit  the  far  face,  a 
toroidal  shift  of  the  data  is  performed  and  new  rays  are  initialized  to  enter  the  visible  side  face  of  the 
volume.  As  a  result,  the  rotation  angle  selected  influences  about  10%  of  the  runtime  of  the  algorithm. 
Tests  using  a  1283-voxel  data  set  on  both  the  CM2  from  8K  to  32K  processors  in  size  and  the  Princeton 
Engine  of  1024  processors  show  subsecond  rendering  time. 

2.4  Vezina 

Vezina,  et  id.  [22]  implement  a  multi-pass  algorithm  similar  to  Schroder’s  on  MP-1,  which  is  a  massively 
data-parallel  SIMD  computer  with  a  2D  array  of  processing  elements  (PEs).  Their  algorithm,  based 
on  work  done  by  Catmull  and  Smith  [2],  and  Hanrahan  [7],  converts  both  3D  rotation  and  perspec- 
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tive  transformations  into  only  four  ID  shear/scale  passes,  compared  to  Schroder’s  eight-pass  rotation 
algorithm  composed  exclusively  of  shear  operations.  Volume  transposition  is  then  performed  to  localize 
data  access.  MP-1  provides  a  global  router  which  allows  efficient  moving  of  data  between  PEs.  On  a 
16K-PE  MP-1,  a  128x  128-pixel  volume  rendered  image  of  a  1283-voxel  data  can  be  generated  in  sub¬ 
seconds.  However,  it  seems  that  if  either  a  smaller  number  of  PEs  or  larger  data  sets  are  used,  the  data 
transposition  time  can  degrade  the  peiformance  significantly. 

3  A  Divide-and-Conquer  Algorithm 

The  idea  behind  our  algorithm  is  very  simple:  divide  the  data  up  into  smaller  subvolumes  distributed 
to  multiple  computers,  render  them  separately  and  locally ,  and  combine  the  resulting  images  in  an 
incremental  fashion.  While  multiple  computers  are  available,  the  memory  demands  on  each  computer 
are  modest  since  each  computer  need  only  hold  a  subset  of  the  total  data  set.  This  approach  can  be  used 
to  render  high  resolution  data  sets  in  an  environment,  for  example,  with  many  midrange  workstations 
( e.g .  equipped  with  16MB  memory)  on  a  local  area  network.  Many  computing  environments  have 
an  abundance  of  such  workstations  which  could  be  harnessed  for  volume  rendering  provided  that  the 
memory  usage  on  each  machine  is  reasonable. 

3.1  Ray- Traced  Volume  Rendering 

The  starting  point  of  our  algorithm  is  the  volume  ray-tracing  technique  presented  by  Levoy  [8],  An  image 
is  constructed  in  image  order  by  casting  rays  from  the  eye  through  the  image  plane  and  into  the  volume 
of  data.  One  ray  per  pixel  is  generally  sufficient,  provided  that  the  image  sample  density  is  higher  than 
the  volume,  data  sample  density.  Using  a  discrete  rendering  model,  the  data  volume  is  sampled  at  evenly 
spaced  points  along  the  ray,  usually  at  a  rate  of  one  to  two  samples  per  voxel.  At  each  sample  point  on 
the  ray,  a  color  and  an  opacity  are  computed  using  trilinear  interpolation  from  the  data  values  at  each 
of  the  eight  nearest  voxels. 

The  color  is  assigned  by  applying  a  shading  function  such  as  the  Phong  lighting  mode],  A  color  map 
is  often  used  to  assign  colors  to  the  raw  data  values.  The  normalized  gradient  of  the  data  volume  can 
be  used  as  the  surface  normal  for  shading  calculations.  The  opacity  is  derived  by  using  the  interpolated 
voxel  values  as  indices  into  an  opacity  map.  Sampling  continues  until  the  data  volume  is  exhausted  or 
until  the  accumulated  opacity  reaches  a  threshold  cut-off  value.  The  final  image  value  corresponding  to 
each  ray  is  formed  by  compositing,  front-to-back,  the  colors  and  opacities  of  the  sample  points  along  t  he 
ray.  The  color/opacity  compositing  is  based  on  Porter  and  Duff’s  over  operator  [17].  It  is  easy  to  verify 
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that  the  over  is  associative;  that  is, 


a  over  (6  over  c)  =  (a  over  b)  over  c. 

The  associativity  of  the  over  operator  allows  us  to  break  a  ray  up  into  segments,  process  the  sampling 
and  compositing  of  each  segment  independently,  and  combine  the  results  from  each  segment  via  a  final 
compositing  step.  This  is  the  basis  for  our  parallel  volume  rendering  algorithm. 

3.2  Data  Subdivision/Load  Balancing 

The  divide-and-conquer  algorithm  requires  that  we  partition  the  input  data  into  subvolumes.  There  are 
many  ways  to  partition  the  data;  the  only  requirement  is  that  an  unambiguous  front-to-back  ordering 
can  be  determined  for  the  subvolumes  to  establish  the  required  order  for  compositing  subimages.  Ideally 
we  would  like  each  subvolume  to  require  about  the  same  amount  of  computation.  In  practice,  this 
is  generally  not  something  that  we  can  always  control  well.  For  example,  if  the  viewpoint  is  known 
and  fixed,  we  could  partition  the  volume  in  a  manner  that  minimizes  the  overlap  between  the  images 
resulting  from  the  subvolumes.  This  will  reduce  the  cost  of  the  merging  since  compositing  need  only 
be  applied  where  subimages  overlap  as  shown  later.  For  an  animation  sequence,  this  technique  can  not 
be  applied  since  the  viewpoint  changes  with  each  frame.  We  can  also  partition  the  volume  based  on 
an  estimation  of  the  distribution  of  the  amount  of  computation  within  the  volume  by  preprocessing  the 
volume  to  identify  high  gradient  regions  or  empty  regions.  In  addition,  we  may  partition  and  distribute 
the  volume  according  to  the  performance  of  individual  computers  when  using  a  heterogeneous  computing 
environment. 

The  simplest  method  is  probably  to  partition  the  volume  along  planes  parallel  to  the  coordinate  planes 
of  the  data.  Again,  if  the  viewpoint  is  fixed  and  known  when  partitioning  the  data,  the  coordinate  plane 
most  nearly  orthogonal  to  the  view  direction  can  be  determined  and  the  data  can  subdivided  into  “slices” 
orthogonal  to  this  plane.  When  orthographic  projection  is  used,  this  will  tend  to  produce  subimages 
with  little  overlap.  If  the  view  point  is  not  known,  or  if  perspective  projection  is  used,  it  is  better  to 
partition  the  volume  equally  along  all  coordinate  planes.  This  can  be  accomplished  using  a  k-D  tree 
structure  [1],  with  alternating  binary  subdivision  of  the  coordinate  planes  at  each  level  in  the  tree  as 
indicated  in  Figure  1.  As  shown  later,  this  structure  provides  a  nice  mechanism  for  image  compositing. 

As  shown  in  Figure  2,  when  a  volume  of  grid  points  (voxels)  is  evenly  subdivided  into,  for  example, 
two  subvolumes,  each  subvolume  may  contain  half  of  the  total  grid  points.  Note  that  each  voxel  is 
located  at  a  corner  of  the  grid.  Consequently,  those  ray  samples  that  lie  in  the  cut  boundary  region  (the 
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Figure  1:  k-Dtree  Subdivision  of  a  Data  Volume 


dotted  region)  are  lost.  If  the  view  vector  is  parallel  to  the  cut  plane,  a  black  strip  will  appear  at  each 
cut  boundary  in  the  composited  image.  In  order  to  avoid  this  problem,  we  need  to  replicate  one  layer 
of  the  boundary  grid  at  each  subvolume  so  the  composited  ray-casting  image  does  not  drop  out  features 
originally  in  the  volume.  For  the  case  shown  in  Figure  2,  one  possible  arrangement  is  that  Subvolume  1 
includes  layer  1  to  layer  k  and  Subvolume  2  includes  layer  k  to  layer  n;  that  is,  in  Subvolume  2,  layer  k 
is  replicated. 

3.3  Parallel  Rendering 

We  use  ray-casting  based  volume  rendering.  Each  computer  can  perform  raytracing  independently;  that 
is,  there  is  no  data  communication  required  during  the  subvolume  rendering.  All  subvolumes  are  rendered 
using  an  identical  view  position  and  only  rays  within  the  image  region  covering  the  corresponding 
subvolume  are  cast  and  sampled.  Since  we  sample  along  each  ray  at  a  predetermined  interval,  consistent 
sampling  locations  must  be  ensured  for  all  subvolumes  so  we  can  reconstruct  the  original  volume.  As 
shown  in  Figure  3,  for  example,  the  location  of  the  first  sample  52(1)  on  the  ray  shown  in  Subvolume 
2  should  be  calculated  correctly  so  that  the  distance  between  52(1)  and  5i(n)  is  equivalent  to  the 
predetermined  interval.  Otherwise,  small  features  in  the  data  might  be  lost  or  enhanced  in  an  erroneous 
way. 
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Figure  2:  Volume  Boundary  Replication. 


3.4  Image  Composition 

The  final  step  of  our  algorithm  is  to  merge  ray  segments  and  thus  all  partial  images  into  the  final  total 
image.  In  order  to  merge,  we  need  to  store  not  only  the  color  at  each  pixel  but  also  the  accumulated 
opacity  there.  As  described  earlier,  the  rule  for  merging  subimages  is  based  on  the  over  compositing 
operator.  When  all  subimages  are  ready,  they  are  composited  in  a  front-to-back  order.  For  a  straight¬ 
forward  one-dimensional  data  partition,  this  order  is  also  straightforward.  When  using  the  k-1)  tree 
structure,  this  front-to-back  image  compositing  order  can  then  be  determined  hierarchically  by  a  recur 
sive  traversal  of  the  k-D  tree  structure,  visiting  the  “front”  child  before  the  “back”  child.  This  is  similar 
to  well  known  front-to-back- traversals  of  BSP-trees  [4]  and  octrees  [3].  In  addition,  the  hierarchital 
structure  provides  a  natural  way  to  accomplish  the  compositing  in  parallel:  sibling  nodes  in  the  tree 
may  be  processed  concurrently; 

A  naive  approach  for  merging  the  partial  images  is  to  do  binary  compositing.  By  pairing  up  computers 
in  order  of  compositing,  each  disjoint  pair  produces  a  new  subimage.  Thus  after  the  first  stage,  we  are 
left  with  the  task  of  compositing  only  |  subimages.  Then  we  use  half  the  number  of  the  original 
computers,  and  pair  them  up  for  the  next  level  compositing.  Continuing  similarly,  after  log  n  stages, 
the  final  image  is  obtained.  One  problem  for  the  above  methods  is  that  during  the  compositing  process 
compositing,  many  computers  become  idle.  At  the  top  of  the  tree,  only  one  processor  is  active,  doing 
the  final  composite  for  the  entire  image.  When  running  on  a  massively  parallel  computer  like  CM-.1) 
with  thousands  of  processors,  this  would  significantly  affect  the  overall  performance;  consequently,  the 
compositing  process  would  become  a  bottleneck  when  interactive  rendering  rates  are  desired.  To  avoid 
this  problem,  we  have  generalized  the  binary  compositing  method  so  that  every  processor  participates 
in  all  the  stages  of  the  compositing  process.  We  call  the  new  scheme  binary-swap  compositing.  The  key 
idea  is  that,  at  each  compositing  stage,  the  two  processors  involved  in  a  composite  operation  split  the 
image  plane  into  two  pieces  and  each  processor  takes  responsibility  for  one  of  the  two  pieces. 

In  the  early  phases  of  the  algorithm,  each  processor  is  responsible  for  a  large  portion  of  the  image 
area,  but,  the  image  area  is  usually  sparse  since  it  includes  con,T-'butions  only  from  a  few  processors. 
In  later  phases,  as  we  move  up  the  compositing  tree,  the  processors  are  responsible  for  a  smaller  and 
smaller  portion  of  the  image  area,  but  the  sparsity  decreases  since  an  increasing  number  of  processors 
have  contributed  image  data.  At  the  top  of  the  tree,  all  processors  have  complete  information  for  a  small 
rectangle  of  the  image.  The  final  image  can  be  constructed  by  tiling  these  subimages  onto  the  display. 

Figure  4  illustrates  the  binary-swap  compositing  algorithm  graphically  for  four  processors.  When  all 
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four  computers  finish  ray-tracing  locally,  each  computer  holds  a  partial  image,  as  depicted  in  (a).  Then 
each  partial  image  is  subdivided  into  two  half-images  by  splitting  along  the  X  axis.  In  our  example,  as 
shown  in  (b),  Computer  1  keeps  only  the  left  half-image  and  sends  its  right  half-image  to  its  immediate- 
right  sibling,  which  is  Computer  2.  r .  tersely,  Computer  2  keeps  its  right  half-image,  and  sends  its 
left  half-image  to  Computer  1.  Butn  computers  then  composite  the  half  image  they  keep  with  the  half 
image  they  receive.  A  similar  exchange  and  compositing  of  partial  images  is  done  between  Computer 

3  and  4.  After  the  ;:.rst  stage,  each  computer  only  holds  a  partial  image  that  is  half  the  size  of  the 
original  one.  In  the  next  stage,  Computer  1  alternates  the  image  subdivision  direction.  This  time  it 
keeps  the  upper  half-image  and  sends  the  lower  half-image  to  its  second-immediate-right  sibling,  which 
is  Computer  3,  as  shown  in  (c).  Conversely,  Computer  3  trades  its  upper  half-image  for  Computer  1's 
lower  half-image  for  compositing.  Concurrently,  a  similar  exchange  and  compositing  between  Computer 
2  and  4  are  done.  After  this  stage,  each  computers  hold  only  one-fourth  of  the  original  image.  For  this 
example,  we  are  done  and  each  computer  sends  its  image  to  the  display  device.  The  final  composited 
image  is  shown  in  (d).  It  has  been  brought  to  our  attention  that  a  similar  merging  algorithm  has  been 
developed  independently  by  Mackerras  [13]. 

Figure  5  shows  the  psuedo  code  of  the  same  compositing  algorithm  when  the  number  of  processors 
(nproc)  is  a  perfect  power  of  tw'o.  We  assume  that  processors  are  numbered  from  0  to  nproc-1  and 
that  self  is  an  integer  containing  the  current  processor  number.  There  are  logjlnproc)  phases  ami  a 
phase  corresponding  to  each  level  in  the  compositing  tree.  During  each  phase,  each  processor  exchanges 
data  with  its  partner  that  is  stride  away  from  it.  The  stride  value  steps  from  1  up  to  in  powers 

of  2. 

In  our  current  implementation,  the  number  of  processors  (nproc)  must  be  a  perfect  power  of  two.  This 
simplifies  the  calculations  needed  to  identify  the  compositing  partner  at  each  stage  of  the  compositing  tree 
and  ensures  that  all  processors  are  active  at  every  compositing  phase.  The  algorithm  can  be  generalized 
to  relax  this  restriction  if  the  compositing  tree  is  kept  as  a  full  (but  not  necessarily  complete)  binary 
tree,  with  some  additional  complexity  in  the  compositing  partner  compulation  and  with  some  processors 
remaining  idle  during  the  first  compositing  phase. 

4  Implementation  of  the  Renderer 

We  have  implemented  two  versions  of  our  distributed  volume  rendering  algorithm:  one  on  the  CM-5  and 
another  on  groups  of  networked  workstations.  Our  implementation  is  composed  of  three  major  pieces 
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Figure  4:  Parallel  Compositing  Process. 
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Initialize  image  plane  to  entire  image; 
for(stride=l;  strideCnproc;  stride  *  =  2) 
{ 

partner  -=  self  XOR  stride; 

Subdivide  image  plane; 

Exchange  image  data  with  partner; 

Composite  our  part  of  the  remaining 
image  plane  with  partners  image  data; 

} 


Figure  5:  Psuedo  Code  for  Binary-Swap  Compositing 

of  code:  a  data  distributor,  a  Tenderer,  and  an  image  compositor.  Currently,  the  data  distributor  is  a 
part  of  the  host  program  which  reads  data  piece  by  piece  from  disk  and  distributes  to  each  machine 
participating.  Alternatively,  each  node  program  could  read  their  piece  from  disk  directly. 

The  Tenderer  implements  a  conventional  ray-traced  volume  rendering  algorithm  [8]  using  a  Phong 
lighting  model  [16].  Our  Tenderer  is  a  basic  Tenderer  and  is  not  highly  tuned  for  best  performance. 
Compared  to  a  performance  tuned  ray-traced  volume  rendering  program  we  implemented  previously  [10], 
we  estimate  that  the  current  implementation  of  the  Tenderer  can  be  further  improved  in  speed  by  10-15%. 
in  fact,  data  dependent  optimization  methods  might  affect  load  balancing  decisions  by  accelerating  the 
progress  on  some  processors  more  than  others.  For  example,  a  processor  tracing  through  empty  space 
will  probably  finish  before  another  processor  working  on  a  dense  section  of  the  data.  We  are  currently 
exploring  data  distribution  heuristics  that  can  take  the  complexity  of  the  subvolumes  into  account  when 
distributing  the  data  to  ensure  equal  load  on  all  processors. 

For  shading  the  volume,  surface  normals  are  approximated  as  local  gradients  using  central  differencing. 
We  trade  memory  for  time  by  precomputing  and  storing  the  three  components  of  the  gradient  at  each 
voxel.  As  an  example,  for  a  data  set  of  size  256x256x256,  more  than  200  megabyte  are  required  to  store 
both  the  data  and  the  precomputed  gradients.  This  memory  requirement  prevents  us  from  sequentially 
rendering  this  data  set  on  most  of  our  workstations. 

4.1  CM-5  and  CMMD  3.0 

The  CM-5  is  a  massively  parallel  supercomputer  which  supports  both  the  SIMD  and  MIMD  programming 
models  [20].  The  CM-5  in  the  Advanced  Computing  Laboratory  at  Los  Alamos  National  Laboratory 
has  1024  nodes,  each  of  which  is  a  Sparc  microprocessor  with  16MB  of  local  RAM  and  four  64-bit  wide 
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vector  units.  With  four  vector  units  up  to  128  operations  can  be  performed  by  a  single  instruction.  This 
yields  a  theoretical  speed  of  128  GFlops  for  a  1024-node  CM-5.  The  nodes  can  be  divided  into  partitions 
whose  size  must  be  a  power  of  two.  A  user’s  program  is  constrained  to  operating  within  a  partition. 
Our  CM-5  implementation  of  the  parallel  volume  Tenderer  takes  advantages  of  the  MIMD  programming 
features  of  the  CM-5.  MIMD  programs  use  CMMD,  a  message  passing  library  for  communications  and 
synchronization,  which  supports  either  a  hostless  model  or  a  host/node  model  [21]. 

We  choose  the  host/node  programming  model  of  CMMD  because  we  wanted  the  option  of  using  X- 
windows  to  display  directly  from  the  CM-5.  The  host  program  determines  which  data-space  partitioning 
to  use,  based  on  the  number  of  nodes  in  the  CM-5  partition,  and  sends  this  information  to  the  nodes.  The 
host  then  optionally  reads  in  the  volume  to  be  rendered  and  broadcasts  it  to  the  nodes.  Alternatively,  the 
data  can  be  read  directly  from  the  DataVault  or  Scalable  Disk  Array  into  the  nodes  local  memory.  The 
hoi .  then  broadcasts  the  opacity/colormap  and  the  transformation  information  to  the  nodes.  Finally, 
the  host  performs  an  I/O  servicing  loop  which  receives  the  rendered  portions  of  the  image  from  the 
nodes. 

The  node  program  begins  by  receiving  its  data-space  partitioning  information  and  then  its  portion  of 
the  data  from  the  host.  It  then  updates  the  transfer  function  and  the  transform  matrices.  Following  this 
step,  the  nodes  all  execute  their  own  copy  of  the  Tenderer.  They  synchronize  after  the  rendering  and 
before  entering  the  compositing  phase.  Once  the  compositing  is  finished,  each  node  has  a  portion  of  the 
image  that  they  then  send  back  to  the  host  for  display. 

4.2  Networked  Workstations  and  PVM  3.1 

Unlike  a  massively  parallel  supercomputer  dedicating  uniform  and  intensive  computing  power,  a  network 
computing  environment  provides  nondedicated  and  scattered  computing  cycles.  Thus,  using  a  set  of  high 
performance  workstations  connected  by  an  Ethernet,  our  goal  is  to  set  up  a  volume  rendering  facility  for 
handling  large  data  sets  and  batch  animation  jobs.  That  is,  we  hope  that  by  using  many  workstations 
concurrently,  the  rendering  time  will  decrease  linearly  and  we  will  be  able  to  render  data  sets  that  are 
too  large  to  render  on  a  single  machine.  Note  that  real-time  rendering  is  generally  not  achievable  in  such 
environment. 

We  use  PVM  (Parallel  Virtual  Machine)  [6],  a  parallel  program  development  environment,  to  imple¬ 
ment  the  data  communications  in  our  algorithm.  PVM  allows  us  to  implement  our  algorithm  port  ably 
for  use  on  a  variety  of  workstation  platforms.  To  run  a  program  under  PVM.  the  user  first  executes  a 
daemon  process  on  the  local  host  machine,  which  in  turn  initiates  daemon  processes  on  all  other  remote 
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machines  used.  Then  the  user’s  application  program  (the  node  program),  which  should  reside  on  each 
machine  used,  can  be  invoked  on  each  remote  machine  by  a  local  host  program  via  the  daemon  pro¬ 
cesses.  Communication  and  synchronization  between  these  user  processes  are  controlled  by  the  daemon 
processes,  which  guarantee  reliable  delivery. 

A  host/node  model  has  also  been  used.  As  a  result,  the  way  it  has  been  implemented  is  very  similar  to 
that  of  CM-5’s.  In  fact,  the  only  distinct  difference  between  the  workstation’s  and  CM-5’s  implementation 
(source  program)  is  the  communication  calls.  Basically,  for  most  of  the  basic  communication  functions, 
PVM  3.1  and  CMMD  3.0  have  one-to-one  equivalence. 

5  Tests 

We  used  three  different  data  sets  for  our  tests.  The  vorticity  data  set  is  a  256x256x256  voxel  CFD  data 
set,  computed  on  a  CM-200,  showing  the  onset  of  turbulence.  The  head  data  set  is  the  now  classic  UNC 
Chapel  Hill  CT  head  at  a  size  of  128x128x128.  The  vessel  data  set  is  a  256x256x128  voxel  Magnetic 
Resonance  Angiography  (MRA)  data  set  showing  the  vascular  structure  within  the  brain  of  a  patient. 
All  test  results  are  presented  graphically  with  the  discussion.  Tables  of  actual  numbers  are  placed  in 
Appendices  to  this  paper  for  interested  readers.  Figure  6  illustrates  the  compositing  process  described  in 
Figure  4,  using  the  images  generated  with  the  vessel  data  set.  Similarly,  each  column  shows  the  images 
from  one  processor,  while  the  rows  are  the  phases  of  the  compositing  algorithm.  The  final  image  is 
displayed  at  the  bottom. 

5.1  CM-5 

We  performed  multiple  experiments  on  the  CM-5  using  partition  sizes  of  32,  64,  128,  256  and  512.  When 
these  tests  were  run,  a  1024  partition  was  not  available.  Timing  results  are  shown  in  Figure  7,  9  and  1 1  for 
the  head,  vessel  and  vorticity  data  sets  respectively.  The  times  shown  in  graphs  (and  tables  in  Appendix 
A)  are  the  maximum  times  for  all  the  nodes  for  the  two  steps  of  the  core  algorithm:  the  rendering  step 
and  the  compositing  step.  All  times  are  given  in  seconds.  The  corresponding  speedup  graphs  are  shown 
in  Figure  8,  10,  and  12.  Note  that  the  speedup  was  also  measured  for  the  core  algorithm  and  it  is  a 
function  of  the  32  node  running  time. 

Looking  at  Figure  7,  9  and  1 1,  it  is  easy  to  see  that  rendering  time  dominates  the  process.  It  should  be 
noted  that  this  implementation  does  not  take  advantage  of  the  CM-5  vector  units.  We  expect  much  faster 
computation  rates  in  the  Tenderer  when  the  vectorized  code  is  completed.  As  there  is  no  communication 
in  the  rendering  step,  one  might  expect  linear  speedup  when  utilizing  more  processors.  As  can  be  seen 


Figure  6:  Illustration  of  the  Parallel  Image  Compositing  Process  using  the  Vessel  Data  Set.  Each  column 
shows  the  images  from  one  processor,  while  the  rows  are  the  phases  of  the  compositing  algorithm.  The 
final  image  is  displayed  at  the  bottom. 
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from  the  three  speedup  graphs,  this  is  not  always  the  case  due  to  the  load  balance  problems.  The 
vorticity  data  set  is  relatively  dense  (i.e.  it  contains  few  empty  voxels)  and  therefore  exhibits  nearly 
linear  speedup.  On  the  other  hand,  both  the  head  and  the  vessel  data  sets  contain  many  empty  voxels 
which  unbalance  the  load  and  therefore  do  not  exhibit  the  best  speedup.  Figure  12  demonstrates  that 
for  the  vorticity  data  set,  our  implementation  achieves  very  good  speedup  for  all  image  sizes  except 
64x64.  The  rendering  of  the  64x64  image  exhibits  less  speedup  than  larger  image  sizes  due  to  overhead 
costs  associated  with  the  rendering  and  compositing  steps.  In  particular,  the  compositing  step  showed 
a  speedup  of  only  1.46  when  going  from  32  nodes  to  512  nodes.  For  all  image  resolutions  above  64x64, 
the  overall  speedup  was  nearly  the  same. 

The  compositing  stage  requires  communication  between  pairs  of  nodes  to  perform  the  actual  com¬ 
positing.  In  the  case  of  the  vorticity  data  set,  Figure  13  shows  the  compositing  time  in  solid  lines  and 
the  compositing  time  with  the  communication  time  removed  in  dotted  lines  for  different  sized  CM-5 
partitions.  The  communication  time  varied  from  about  10  percent  to  about  3  percent  of  the  total  com¬ 
positing  time.  As  the  image  size  increases,  both  the  compositing  time  and  the  communication  time  also 
increase.  For  a  fixed  image  size,  increasing  the  partition  size  lowers  the  communication  time  because 
each  node  contains  a  proportionally  smaller  piece  of  the  image. 

We  also  measured  the  data  distribution  time  and  image  gathering  time.  The  data  distribution  time 
includes  the  time  it  takes  to  read  the  data  over  NFS  at  Ethernet  speeds  on  a  loaded  Ethernet.  The 
image  gathering  time  is  the  time  it  takes  for  the  nodes  to  send  their  composited  image  tiles  to  the  host. 
While  other  partitions  were  also  running,  the  data  distribution  time  could  vary  dramatically  due  to 
the  disk  and  Ethernet  ontention.  Taking  the  vorticity  data  set  as  an  example,  the  distribution  data 
varied  from  40  to  90  seconds  regardless  of  the  partition  size.  For  the  above  tests,  a  64x64  image  can  be 
gathered  within  0.01  seconds  and  a  512x512  image  within  1  second.  For  the  same  image  size,  the  image 
gathering  time  is  only  slightly  slower  for  larger  partitions  which  have  more  image-tiles.  Both  of  the  data 
distribution  time  and  image  gathering  time  will  be  mitigated  by  use  of  the  parallel  storage  and  the  use 
of  the  HIPPI  frame  buffer. 

5.2  Networked  Workstations 

For  our  workstation  tests,  we  used  a  set  of  32  high  performance  workstations.  The  first  four  machines 
were  IBM  RS/6000-550  workstations  equipped  with  512  MB  of  memory.  These  workstations  are  rated 
at  81.8  SPECfp92.  The  next  12  machines  were  HP9000/730  workstations,  some  with  32  MB  and  others 
with  64  MB.  These  machines  are  rated  at  86.7  SPECfp92.  The  remaining  16  machines  were  Sun  Sparc- 
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Figure  10:  CM~5  Speedup  for  the  Vessel  Data  Set. 


Figure  13:  CM-5  Compositing/Communication  Times  for  the  Vorticity  Data  Set. 

10/30  workstations  equipped  with  32  MB,  which  are  rated  at  45  SPECfp92.  The  tests  on  one,  two  and 
four  workstations  used  only  the  IBM’s  because  of  their  memory  capacity.  The  tests  with  eight  and  16 
used  a  combination  of  the  HP’s  and  IBM’s.  The  16  Sun’s  were  used  for  the  tests  on  32.  It  was  not 
possible  to  assure  absolute  quiescence  on  each  machine  because  they  are  in  a  shared  environment  with 
a  large  shared  Ethernet  and  files  systems.  During  the  period  of  testing  there  was  network  traffic  from 
network  file  system  activity  and  across-the-net  tape  backups.  In  addition,  a  few  workstations  lie  on 
different  subnets.  Thus  the  communication  performance  was  little  difficult  to  characterize. 

The  actual  numbers  of  the  test  results  are  provided  in  Appendix  B.  Wc  display  the  same  information 
graphically  in  Figure  14,  15  and  16  for  the  head,  vessel  and  vorticity  data  sets,  respectively.  In  a 
heterogeneous  environment,  it  is  less  meaningful  to  use  speedup  graphs  to  study  the  performance  of  our 
algorithm  and  implementation  so  speedup  graphs  are  not  provided. 

From  the  test  results,  we  see  that  the  rendering  time  still  dominates  when  using  eight  or  fewer 
workstations.  It  is  also  less  beneficial  to  render  smaller  images  due  to  the  overhead  costs  associated 
with  the  rendering  and  compositing  steps.  Unlike  the  CM-5’s  results,  tests  on  workstations  show  that 
the  communication  component  is  the  dominant  factor  in  the  compositing  costs.  On  the  average,  as 
shown  in  Figure  17,  communication  takes  about  97%  of  the  overall  compositing  time.  This  can  be  seen 
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by  comparing  the  solid  lines  with  the  dotted  lines  in  the  graph.  For  the  CM-5,  a  large  partition  improves 
the  overall  communications  time  because  not  only  each  node  then  contains  a  proportionally  smaller  piece 
of  the  image  but  also  the  network  bandwidth  scales  with  the  partition  size  on  the  CM5.  However,  this 
is  not  true  for  the  networked  workstations  because  the  ethernet  bandwidth  used  with  PVM  is  fixed. 

For  large  images  (e.g.  512x512),  considering  the  Ethernet  speed,  it  seems  worthwhile  to  compress  the 
subimages  used  in  the  swapping  process.  We  have  incorporated  a  compression  program  into  the  Tenderer 
by  using  an  algorithm  described  in  [23].  In  Figure  18,  results  for  the  vorticity  data  set  at  an  image  size  of 
512x512  are  shown.  Apparently,  compressing  the  subimages  before  swaps  helps  reduce  the  compositing 
cost  significantly,  especially  when  more  workstations  were  used.  For  example,  in  the  32-workstation 
case,  the  total  compositing  time  for  one  workstation  was  about  1.465  seconds  including  0.042  seconds 
for  the  raw  compositing,  0.088  seconds  for  the  compression  and  1.271  seconds  for  the  communication. 
The  compression  ratio  was  about  four  to  one,  resulting  in  about  80%  faster  communication  rates.  When 
rendering  time  dominates  due  to  either  the  use  of  slower  processors  or  fewer  processors,  the  compression 
option  is  not  as  worthwhile. 

The  data  distribution  and  image  gather  times  also  varied  greatly,  due  to  the  variable  load  on  the 
shared  Ethernet.  The  data  distribution  times  varied  from  17  seconds  to  150  seconds  while  the  image 
gather  times  varied  from  an  average  of  .06  seconds  for  a  64  x64  image  to  a  high  of  8  seconds  for  a  5 1 2  x 5 1 2 
image.  The  above  test  results  were  based  on  Version  3.1  of  PVM.  Our  initial  tests  using  PVM  2.4.2 
show  a  much  higher  communication  cost,  more  than  70%  higher,  as  reported  in  [11]. 

In  a  shared  computing  environment,  the  communication  costs  are  highly  variable  due  to  the  use  of  the 
local  Ethernet  shared  with  hundreds  of  other  machines.  There  are  many  factors  that  we  have  no  control 
over  that  are  influential  to  our  algorithm.  For  example,  an  overloaded  network  and  other  users'  processes 
competing  with  our  rendering  process  for  CPU  and  memory  usage  could  greatly  degrade  the  performance 
of  our  algorithm.  Improved  performance  could  be  achieved  by  carefully  distributing  the  load  to  each 
computer  according  to  data  content,  and  the  computer’s  performance  as  well  as  its  average  usage  by 
other  users.  Moreover,  communications  costs  are  expected  to  drop  with  higher  speed  interconnection 
networks  (e.g.  FDDI)  and  on  clusters  isolated  from  the  larger  local  area  network. 

6  Conclusions 


We  have  presented  a  parallel  volume  ray-tracing  algorithm  for  a  massively  parallel  computer  or  a  set  of 
interconnected  workstations.  The  algorithm  divides  both  the  computation  and  memory  load  across  all 
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Figure  16:  PVM  Results  on  the  Vorticity  Data  Set. 


Figure  17:  PVM  Compositing/ Communication  Times  for  the  Vorticity  Data  Set 


Figure  18:  Comparing  the  Compositing  Time  with  and  without  Compression  using  the  Vorticity  Data 
Set. 

processors  and  can  therefore  be  used  to  render  data  sets  that  are  too  large  to  fit  into  the  memory  system 
of  a  single  uniprocessor.  A  parallel  ( binary-swap )  compositing  method  was  developed  to  combine  the 
independently  rendered  results  from  each  processor.  The  binary-swap  compositing  method  has  merits 
which  make  it  particularly  suitable  for  massively  parallel  processing.  First,  while  the  parallel  compositing 
proceeds,  the  decreasing  image  size  for  sending  and  compositing  makes  the  overall  compositing  process 
very  efficient.  Next,  this  method  always  keeps  all  processors  busy  doing  useful  work.  Finally,  it  is  simple 
to  implement  with  the  use  of  the  k-D  tree  structure  described  earlier. 

The  algorithm  has  been  implemented  on  both  the  CM-5  and  a  network  of  scientific  workstations.  The 
CM-5  implementation  showed  good  speedup  characteristics  out  to  the  largest  available  partition  size  of 
512  nodes.  Only  a  small  fraction  of  the  total  rendering  time  was  spent  in  communications,  indicated  the 
success  of  the  parallel  compositing  method.  Several  directions  appear  ripe  for  further  work.  The  host 
data  distribution,  image  gather,  and  display  times  are  bottlenecks  on  the  current  CM-5  implementation. 
These  bottlenecks  can  be  alleviated  by  exploiting  the  parallel  1/0  capabilities  of  the  CM-5.  Rendering 
and  compositing  times  on  the  CM-5  can  also  be  reduced  significantly  by  taking  advantage  of  the  vector 
units  available  at  each  processing  node.  We  are  hopeful  that  real  time  rendering  rates  will  be  achievable 
at  medium  to  high  resolution  with  these  improvements. 
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Performance  of  the  distributed  workstation  implementation  could  be  further  improved  by  better  load 
balancing.  In  a  heterogeneous  environment  with  shared  workstations,  linear  speedup  is  difficult.  A 
simple  approach  is  to  do  static  load  balancing.  The  data  subdivision  can  be  done  unevenly,  taking  into 
account  the  predicted  capacity  on  each  machine  to  try  to  balance  the  load.  Alternatively,  the  data  can 
be  subdivided  into  a  larger  number  of  equal  sized  subvolumes  and  the  more  capable  machines  can  be 
assigned  more  than  one  subvolume.  The  later  approach  has  the  advantage  that  it  can  be  generalized  to  a 
dynamic  load  balancing  approach:  divide  the  data  into  many  subvolumes  and  assign  them  to  processors 
in  a  demand  driven  fashion.  The  finer  subdivision  of  the  data  volumes  would  improve  load  balancing 
during  rendering  at  the  cost  of  some  additional  compositing  time  due  to  more  levels  in  the  compositing 
tree. 
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Appendix  A:  CM-5’s  Test  Results 
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Table  1:  CM-5  Results  on  Head  Data  Set 
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Table  2:  CM-5  Results  on  Vessel  Data  Set 


Table  3:  CM-5  Results  on  Vorticity  Data  Set 
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Table  4:  CM-5  Compositing  Communication  Times  for  the  Vorticity  Data  Set 
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Appendix  B:  Workstations’  Test  Results 
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Table  5:  PVM  Results  on  the  Head  Data  Set 
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Table  6:  PVM  Results  on  the  Vessel  Data  Set 
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Table  7:  PVM  Results  on  the  Vorticity  Data  Set 
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Table  8:  PVM  Compositing  Communication  Times  on  the  Vorticity  Data  Set 
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